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An Online Parallel and Distributed Algorithm for 
Recursive Estimation of Sparse Signals 

Yang Yang, Mengyi Zhang, Marius Pesavento, and Daniel P. Palomar 


Abstract —In this paper, we consider a recursive estimation 
problem for linear regression where the signal to be estimated 
admits a sparse representation and measurement samples are 
only sequentially available. We propose a convergent parallel 
estimation scheme that consists in solving a sequence of ii- 
regularized least-square problems approximately. The proposed 
scheme is novel in three aspects: i) all elements of the unknown 
vector variable are updated in parallel at each time instance, and 
convergence speed is much faster than state-of-the-art schemes 
which update the elements sequentially; ii) both the update 
direction and stepsize of each element have simple closed-form 
expressions, so the algorithm is suitable for online (real-time) 
implementation; and iii) the stepsize is designed to accelerate the 
convergence but it does not suffer from the common trouble of 
parameter tuning in literature. Both centralized and distributed 
implementation schemes are discussed. The attractive features of 
the proposed algorithm are also numerically consolidated. 


I. Introduction 

Signal estimation has been a fundamental problem in a 
number of scenarios, such as wireless sensor networks (WSN) 
and cognitive radio (CR). WSN has received a lot of attention 
and is found to be useful in diverse disciplines such as environ¬ 
mental monitoring, smart grid, and wireless communications 
ifltl . CR appears as an enabling technique for flexible and 
efficient use of the radio spectrum |^, since it allows 
the unlicensed secondary users (SUs) to access the spectrum 
provided that the licensed primary users (PUs) are idle, and/or 
the interference generated by the SUs to the PUs is below a 
certain level that is tolerable for the PUs 1111. 

One in CR systems is the ability to obtain a precise estimate 
of the PUs’ power distribution map so that the SUs can avoid 
the areas in which the PUs are actively transmitting. This is 
usually realized through the estimation of th^osition, transmit 
status, and/or transmit power of PUs Oa III H Im], and such an 
estimation is typically obtained based on the minimum mean- 
square-error (MMSE) criterion mini min Hi [I. 
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The MMSE approach involves the calculation of the ex¬ 
pectation of a squared f 2 -norm function that depends on the 
so-called regression vector and measurement output, both of 
which are random variables. This is essentially a stochastic 
optimization problem, but when the statistics of these ran¬ 
dom variables are unknown, it is impossible to calculate the 
expectation analytically. An alternative is to use the sample 
average function, constructed from the sequentially available 
measurements, as an approximation of the expectation, and 
this leads to the well-known recursive least-square (RES) 
algorithm OIH [H As the measurements are available 
sequentially, at each time instance of the RES algorithm, an LS 
problem has to be solved, which furthermore admits a closed- 
form solution and thus can efficiently be computed. More 
details can be found in standard textbooks such as HQ. 

In practice, the signal to be estimated may be sparse in 
nature llldi [Zl [iSl Isl [ill ■ In a recent attempt to apply the RES 
approach to estimate a sparse signal, a regularization function 
in terms of fi-norm was incorporated into the LS function to 


encourage sparse estimates Ill4 ll[]. leading to an -regularized 


LS problem which has the form of the least-absolute shrinkage 
and selection operator (LASSO) lllfil] . Then in the recursive 
estimation of a sparse signal, the only difference from standard 
RES is that at each time instance, instead of solving an LS 
problem as in RES, an £i-regularized LS problem in the form 
of LASSO is solved 111]. 

However, a closed-form solution to the -regularized LS 
problem no longer exists because of the £i-norm regularization 
function and the problem can only be solved iteratively. As a 
matter of fact, iterative algorithms to solve the .fi-regularized 
LS problems have been the center of extensive research in re¬ 
cent wars and a number of solvers have been developed, e.g., 
GP iill, ll_ls mt], FISTA IH, ADMM and FLEXA 
112ill . Since the measurements are sequentially available, and 
with each measurement, a new -regularized LS problem is 
formed and solved, the overall complexity of using solvers for 
the whole sequence of -regularized LS problems is no longer 
affordable. If the environment is furthermore fast changing, 
this method is not even real-time applicable because new 
samples may have already arrived before the old -regularized 
LS problem is solved. 

To make the estimation scheme suitable for online (real- 
time) implementation, a sequential algorithm was proposed 
in Oldll . in which the fi-regularized LS problem at each 
time instance is solved only approximately. In particular, at 
each time instance, the £i-regularized LS problem is solved 
with respect to (w.r.t.) only a single element of the unknown 
vector variable (instead of all elements as in a solver) while 
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remaining elements are fixed, and the element is updated in 
closed-form based on the so-called soft-thresholding operator 
S. After a new sample arrives, a new -regularized LS 
problem is formed and solved w.r.t. the next element while 
remaining elements are hxed. This sequential update rule is 
known in literature as block coordinate descent method 11221] . 
To our best knowledge, il4|] is the only work on online 
algorithms for recursive estimation of sparse signals. 

Intuitively, since only a single element is updated at each 
time instance, the online algorithm proposed in jldt] sometimes 
suffers from slow convergence, especially when the signal has 
a large dimension while large dimension of sparse signals is 
universal in practice. It is tempting to use the parallel algorithm 
proposed in imi, but it works for deterministic optimiza¬ 
tion problems only and may not converge for the stochastic 
optimization problem at hand. Besides, its convergence speed 
heavily depends on the stepsize. Typical stepsizes are Armijo- 
like successive line search, constant stepsize, and diminishing 
stepsize. The former two suffer from high complexity and 


slow convergence 112 iL Remark 4], while the decay rate of 


the diminishing stepsize is very difficult to choose: on the one 
hand, a slowly decaying stepsize is preferable to make notable 
progress and to achieve satisfactory convergence speed; on the 
other hand, theoretical convergence is guaranteed only when 
the stepsizes decays fast enough. It is a difficult task on its 
own to find the decay rate that gives a good trade-off. 

A recent work on parallel algorithms for stochastic opti¬ 
mization is 11241] . However, the algorithms proposed in 12411 are 
not applicable for the recursive estimation of sparse signals. 
This is because the regularization function in 0 must be 
strongly convex and differentiable while the regularization 
gain must be lower bounded by some positive constant so that 
convergence can be achieved, but the regularization function 
in terms of £i-norm in this paper is convex (but not strongly 
convex) and nonsmooth while the regularization gain is de¬ 
creasing to 0. 

In this paper, we propose an online parallel algorithm 
with provable convergence for recursive estimation of sparse 
signals. In particular, our contributions are as follows: 

1) At each time instance, the £ 1 -regularized LS problem is 
solved approximately and all elements are updated in parallel, 
so the convergence speed is greatly enhanced compared with 
M- As a nontrivial extension of il4|] from sequential update 
to parallel update and MM from deterministic optimization 
problems to stochastic optimization problems, the convergence 
of the proposed algorithm is established. 

2) The proposed stepsize is based on the so-called minimiza¬ 
tion rule (also known as exact line search) and its benefits are 
twofold: hrstly, it guarantees the convergence of the proposed 
algorithm, which may however diverge under other stepsize 
rules; secondly, notable progress is achieved after each variable 
update and the trouble of parameter tuning in 1123112 111 is saved. 
Besides, both the update direction and stepsize of each element 
have a simple closed-form expression, so the algorithm is fast 
to converge and suitable for online implementation. 

3) When implemented in a distributed manner, for example 
in CR networks, the proposed algorithm has a much smaller 
signaling overhead than in state-of-the-art techniques QH. 


Besides, the estimates of different SUs are always the same 
and they are based on the global information of all SUs. 
Co^ared with consensus-based distributed implementations 
ui liE where each SU makes individual estimate decision 
mainly based on his own local information and all SUs con¬ 
verge to the same estimate only asymptotically, the proposed 
approach can better protect the Quality-of-Service (QoS) of 
PUs because it eliminates the possibility that the estimates 
maintained by different SUs lead to conflicting interests of 
the PUs (i.e., some may correctly detect the presence of PUs 
but some may not in consensus-based algorithms). 

The rest of the paper is organized as follows. In Section 
El we introduce the system model and formulate the recursive 
estimation problem. The online parallel algorithm is proposed 
in Section |III1 and its implementations and extensions are 
discussed in Section HV] The performance of the proposed 
algorithm is evaluated numerically in Section |V] and finally 
concluding remarks are drawn in Section [Vl] 

Notation: We use x, x and X to denote scalar, vector and 
matrix, respectively. Xjk is the {j, A:)-th element of X; Xk 
and Xj^k is the fc-th element of x and xj, respectively, and 
X = {xk)^=i and Xj = {xj^k)k=i- ‘l(X) is a vector that 
consists of the diagonal elements of X. x o y denotes the 
Hadamard product between x and y. [x]^ denotes the element¬ 
wise projection of x onto [a,b]: [x]^ = max(min(x, b),a), 
and [x]’*’ denotes the element-wise projection of x onto the 
nonnegative orthant: [x]’*’ = max(x, 0). X^ denotes the 
Moore-Penrose inverse of X. 


II. System Model and Problem Formulation 

Suppose X* = G K.^ is a deterministic sparse 

signal to be estimated based on the the measurement G R, 
and they are connected through a linear regression model: 

2/n = -hu„, n=l,...,N, (1) 

where N is the number of measurements at any time instance. 
The regression vector g„ = {gn,k)k=i ^ i^ assumed 
to be known, and G R is the additive estimation noise. 
Throughout the paper, we make the following assumptions on 
g„ and v„ forn = 1,..., N: 

(Al) gn are independently and identically distributed (i.i.d.) 
random variables with a bounded positive definite co- 
variance matrix; 

(A2) Vn are i.i.d. random variables with zero mean and 
bounded variance, and are uncorrelated with gn- 

Given the linear model in O, the problem is to estimate 
X* from the set of regression vectors and measurements 
{Sn,yn}n=i- Since both the regression vector g„ and esti¬ 
mation noise Vn are random variables, the measurement is 
also random. A fundamental approach to estimate x* is based 
on the MMSE criterion, which has a solid root in adaptive 
filter theory MM- To improve the estimation precision, all 
available measurements {Sn,yn}n=i exploited to form a 
cooperative estimation problem which consists in finding the 
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variable that minimizes the mean-square-error |25 MW- 


= arg mm 


E 


■ N 

E 

.n—1 


{yn-sl^Y 


= arg mm — x' 

X 2 


Gx — X, 


where G = J2n=i [SnSn] and b = J2n=i ® bjnSn 
the expectation is taken over {gn,J/n}^=i- 

In practice, the statistics of {gntyn}n=i are often not 
available to compute G and b analytically. In fact, the ab¬ 
sence of statistical information is a general rule rather than 
an exception. It is a common approach to approximate the 
expectation in Q by the sample average constructed from the 
samples {g,Y'\ yn'YT=i sequentially available up to time t 

iGl: 


( 2 ) 


and 


’g(*)x-( b(*))^x (3a) 

(3b) 

where G^*^ and b^*) is the sample average of G and b, 
respectively: 


(t) A . -L ' 

^ris = arg mm -x 


G(‘) 4 i 


t N 

r—ln—1 


N 




(4) 

and is the Moore-Penrose pseudo-inverse of A. In litera¬ 
ture, (O is known as recursive least square (RLS), as indicated 
by the subscript “rls”, and can be computed efficiently in 
closed-form, cf. Obi) . 

In many practical applications, the unknown signal x* is 
sparse by nature or by design, but give n by Q is not 
necessarily sparse when t is finite jlbl IlSIl . To overcome 
this shortcoming, a sparsity encouraging function in terms of 
£i-norm is incorporated into the sample average function in 
©, leading to the following fi-regularized sample average 
function at any time instance t = 1,2,... 014 ui[it]: 

1 
2 


L(‘)(x) 4 -x^G(*)x- (b(*))^x- 


||x|| 


1 ’ 


(5) 


where > 0. Define 
lW(x): 

Xiaslo = arg min 


as the minimizing variable of 


lW(> 


f = 1,2. 


( 6 ) 


In literature, problem (|6]) for any fixed t is known as the least- 
absolute shrinkage and selection operator (LASSO) EEi 
(as indicated by the subscript “lasso” in (|6l)). Note that in 
batch processing CSlIi, problem (|6| is solved only once 
when a certain number of measurements are collected (so t is 
equal to the number of measurements), while in the recursive 
estimation of x*, the measurements are sequentially available 
(so t is increasing) and (|6]) is solved repeatedly at each time 
instance f = 1, 2,... 

The advantage of (|6| over (|2|), whose objective function 
is stochastic and whose calculation depends on unknown 
parameters G and b, is that (|6]) is a sequence of determin¬ 
istic optimization problems whose theoretical and algorith¬ 
mic properties have been extensively investigated and widely 


understood. A natural question arises in this context: is 
equivalent to (|2]) in the sense that is a strongly consistent 
estimator of x*, i.e., lim(_>.oo = x* with probability 1? 
The connection between in (|6l) and the unknown variable 
X* is given in the following lemma llldll . 


Lemma 1. Suppose Assumptions (A1)-(A2) as well as the 
following assumption are satisfied for 

(A3) is a positive sequence converging to 0, i.e., 

p,^*'> > 0 and limt_,.oo = 0. 

Then limt_,.oo = x* with probability 1. 


An example of p!3'> satisfying Assumption (A3) is = 
a/t^ with a > 0 and /3 > 0. Typical choices of /3 are /? = 1 
and = 0.5 H . 

Lemma [T] not only states the connection between and 
X* from a theoretical perspective, but also suggests a simple 
algorithmic solution for problem (O: x* can be estimated by 
solving a sequence of deterministic optimization problems (|6]), 
one for each time instance t = 1,2,.... However, different 
from RLS in which each update has a closed-form expression, 
cf. Obl l. problem (|6]) does not have a closed-form solution 
and it can only be solved numerically Iw iterative algorithm 


such as GP 


ADMM 120], and 


o nly be solved numerically by ite 

iEj, ii_1s iH, FISTA 111], _ 

FLEXA lElll . As a result, solving (|6l) repeatedly at each time 
instance t = 1,2,... is neither computationally practical nor 
real-time applicable. The aim of the following sections is to 
develop an algorithm that enjoys easy implementation and fast 
convergence. 


III. The Online Parallel Algorithm 

The LASSO problem in (|6]) is convex, but the objective 
function is nondifferentiable and it cannot be minimized in 
closed-form, so solving (|6l) completely w.r.t. all elements of x 
by a solver at each time instance t — 1,2,... is neither com¬ 
putationally practical nor suitable for online implementation. 
To reduce the complexity of the variable update, an algorithm 
based on inexact optimization is proposed in Cl : at time 
instance t, only a single element Xk with k = mod(f—1, A)-|-l 
is updated by its so-called best response, i.e., L(*)(x) is 
minimized w.r.t. Xk only: = argminL(*)(xfc,x^],) 

with x_fc = which can be solved in closed-form, 

while the remaining elements remain unchanged, i.e., 

x^^^^ = x^^. At the next instance f-l-1, a new sample average 
function is formed with newly arriving samples, 

and the (fc -F l)-th element, x^+i, is updated by minimizing 
^(t+i)(x) vv.r.t. Xfe+i only, while the remaining elements again 
are fixed. Although easy to implement, sequential updating 
schemes update only a single element at each time instance 
and they sometimes suffer from slow convergence when the 
number of elements K is large. 

To overcome the slow convergence of the sequential update, 
we propose an online parallel update scheme, with provable 
convergence, in which (|6]) is solved approximately by simul¬ 
taneously updating all elements only once based on their 
individual best response. Given the current estimate x*^*^ which 
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is available before the t-th sample arrive^ the next estimate 
x(t+i) js determined based on all the samples collected up to 
instance f in a three-step procedure as described next. 

Step 1 (Update Direction): In this step, all elements of 
X are updated in parallel and the update direction of x at 
X = x(‘\ denoted as x*^*) — x(*\ is determined based on the 
best-response For each element of x, say Xk, its best 
response at x = x^*) is given by; 


-(t) A 


x'l,' = argmm 

Xk 


+ \c^k\xk - 


V/fc, 

(7) 

where x_f. = {xj}j^k and it is fixed to their values of the 
preceding time instance x_fc = An additional quadratic 
proximal term with Cfe Qj? included in dTji for numerical 


simplicity and stability | 22 , because it plays an important 
role in the convergence analysis of the proposed algorithm; 
conceptually it is a penalty (with variable weight for 
moving away from the current estimate x^^\ 

After substituting (|5]) into (|7]), the best-response in (|7]l can 
be expressed in closed-form; 


x(t) 


' lp(‘) 2 _ (t) 

1.1.x i. I i. 


= arg mm ■ 

Xk 


( (t) 

i^k 


S, 


'kk-^k 

+p,^*^\xk\ + ^c^k\xk - 


(t)f 


0 


^kk 


.(*) 

-k 


where 


and 


(t) A At) 


= K’-l^G\,x^ 


k = l,...,K, 




( 8 ) 


(9) 


Saib) = (b- a)+ - (-& - a)+ 

is the well-known soft-thresholding operator MM- From 


the definition of in (IHi, ^ 0 and > 0 for all 
k, so the division in ® is well-defined. 

Given the update direction x*^*^ — an intermediate 
update vector x(*)( 7 ) is defined; 

x(‘)( 7 ) =x(‘)-b 7 (x(*)-x(*)), ( 10 ) 

where x^*) = iXk'^)k=i and 7 S [0,1] is the stepsize. The 
update direction x*^*^ — x^*^ is a descent direction of L^‘^(x) 
in the sense specified by the following proposition. 

Proposition 2 (Descent Direction). For x^*) = ixl^'^)k=i 

given in ([Sll and the update direction x*^*) — the following 

holds for any 7 G [ 0 , 1 ].' 




(0 


< -7 c, 


where cl*),, = min^ 


1 \ 

2 ^n 


ft) 


ft) 


ft) 


( 11 ) 


{gS + 4‘)}>o. 


Proof: The proof follows the general line of arguments 


in 112ll Prop. 8 (c)] and is thus omitted here. ■ 

Step 2 (Stepsize): In this step, the stepsize 7 in (fTOl i is 
determined so that fast convergence is observed. It is easy to 


’xO) could be arbitrarily chosen, e.g., xO) = 0. 


see from (fTTT i that for sufficiently small 7 , the right hand side 
of (fTTI) becomes negative and L^) (x) decreases as compared 
to X = x)t). Thus, to minimize L(*1 (x), a natural choice of 
the stepsize rule is the so-called “minimization rule” lE^ Sec. 
2.2.1] (also known as the “exact line search” 11281 Sec. 9.2]), 
which is the stepsize, denoted as 7 op(, that decreases l 1 * 1 (x) 
to the largest extent along the direction — x^) at x = x^): 

7ip| = argmin {l(‘)(xW( 7 )) - l(‘)(xW)} 

0 < 7<1 

= argmin + y{x^*'> - x^*!)) - 

0 < 7<1 


= arg miiK 
0 < 7<1 


i(x(t) _x(‘))'^G(*)(x(*) -xW) 

+ (G(*)x(‘) - b(‘))^(x(*) - x(*)) • 7 
^-|-/rl‘l(||x(*l -I-7(xl*l — x(*l)||p — 



Therefore by definition of we have for any 7 G [0, Ij; 

2,(i)(xW +7(‘)(xW _xl‘))) < L(‘1 (xW -b7(x(‘l -x(‘l)). 

( 13 ) 

The difficulty with the standard minimization rule (fT^ is the 
complexity of solving the optimization problem in (fT^ . since 
the presence of the ^i-norm makes it impossible to find a 
closed-form solution and the problem in (fTSI) can only be 
solved numerically by a solver such as SeDuMi lE^ . 

To obtain a stepsize with a good trade off between con¬ 
vergence speed and computational complexity, we propose a 
simplified minimization rule which yields fast convergence but 
can be computed at a low complexity. Firstly it follows from 
the convexity of norm functions that for any 7 G [0, Ij; 

p(*l(||x(*l + 7 (x(‘) -x(‘))||p - ||x(*l||p) 

= II (1 — 7)xl*^ -f 7X*'*^ 111 ~ 11 ^^*^ 111 

< (1 — y)pG) llx*'*! II p -f yp-G) ||x^*^ ||p — ||i (14a) 

= m(*1(||x(‘1||^-||x(*1||p)- 7. (14b) 

The right hand side of (I14bl) is linear in 7 , and equality is 
achieved in (I14al) either when 7 = 0 or 7 = 1. 

In the proposed simplified minimization rule, instead of 
directly minimizing L^) (^xf) [j)) — L^) (xW'j over 7 , its upper 
bound based on (fT4l i is minimized and yf) is given by 

f i(x(‘) - x(*))'^G(*)(x(*) - x(‘)) • 721 

y)t) A ai-gmin J -f (G*^*lxl‘l — — x^)) -71 

|^+^(t)(||x(t)||^_||x(*)||^).^ J 

( 15 ) 

The scalar problem in (ffSl) consists in a convex quadratic 
objective function along with a bound constraint and it has 
a closed-form solution, given by (fTbl l at the top of the next 
page, where [a;]J = min(max(a;, 0 ), 1 ) denotes the projection 
of X onto [0,1], and obtained by projecting the unconstrained 
optimal variable of the convex quadratic scalar problem in (fTSl) 
onto the interval [ 0 , 1 ]. 

The advantage of minimizing the upper bound function of 
L(*)(x(*)( 7 )) in (fTSl l is that the optimal 7 , denoted as yf), 
always has a closed-form expression, cf. (m. At the same 
time, it also yields a decrease in l 1 * 1 (x) at x = x^) as the 
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7 


it) = 


(GWx(‘) - -xW) + ^W(||x(*)||^- ||x(‘)||^) 

(x(*) — x(‘))^G(*)(x(*) — x(*)) 


(16) 


standard minimization rule 7 opt (fT^ does in (fOl l. and this 
decreasing property is stated in the following proposition. 

Proposition 3. Given and 7*^*1 defined in dTO]) and 

(O, respectively, the following holds: 

2,(‘)(i(t)(7(t))) < l(‘)(xW)^ 

and equality is achieved if and only if = 0 . 


Proof: Denote the objective function in (flSl l as 
2 ^ 1 ‘)(x(t)( 7 )), It follows from (O that 

i(*)(x(t)(7(t))) _ l(*)(x(*)) < r^*\i(*)(7W)), (17) 

and equality in (fTTl) is achieved when 7^*1 = 0 and 7 !*^ = 1 . 
Besides, it follows from the definition of 7 ^*! that 

r(‘)(xW(^W)) < L^*)(xW(7))|.^^o = L(‘)(xW). (18) 

Since the optimization problem in (flSl l has a unique optimal 
solution 7 !*! given by (fTST l. equality in ( fTST i is achieved if and 
only if 7 !*! = 0 . Finally, combining ( fTTl i and ( fTSl l yields the 
conclusion stated in the proposition. ■ 

The signaling required to perform (fTSl l (and also (0) will 
be discussed in Section HVl 

Step 3 (Dynamic Reset): In this step, the next estimate 
x(i+i) js defined based on xl*l( 7 l‘l) given in ( fTOb and ( fTSl l. 
We first remark that xl*l ( 7 !*^) is not necessarily the solution 
of the optimization problem in ([Sll, i.e., 

iW(xW(7(t))) > lW(x« J = mmLW(x). 


This is because x is updated only once from x = x* to x = 
yiit){jit)f which in general can be further improved unless 
x(t)(^(*)j already minimizes L(‘)(x), i.e., x(‘)( 7 W) = 

The definitions of and in (ISJ-tlSll reveal that 


0 = LW(x)|^^o^^^‘^(41).i=1.2,.... 

However, (x^*)( 7 ^*))) may be larger than 0 and x(‘^( 7 (‘)) 
is not necessarily better than the point 0. Therefore we define 
the next estimate to be the best between the two points 

x(*)( 7 (*)) and 0 ; 


x(*+i) = argmin 

, 0 } 


x(*)(7(*)), 

0 , 


if i(i)(i(i)( 7 (t))) < l(‘)( 0 ) = 0 , 
otherwise. 


(19) 


and it is straightforward to infer the following relationship 
among x(*+^^ and 

Moreover, the dynamic reset ( fT9l l guarantees that 

xl*+^) G {x : l1*)(x) < 0}, f = 1,2,..., (20) 


Algorithm 1: The Online Parallel Algorithm 

Initialization: x^^l = 0, f = 1. 

At each time instance f = 1, 2,...: 

Step 1: Calculate x^*! according to (Ull. 

Step 2: Calculate 7 ^*! according to ( fT^ . 

Step 3-1: Calculate according to ( fTOb . 

Step 3-2: Update xl*+^^ according to (fTOli. 


Since limt^oo >~ 0 and converges from Assumptions 
(A1)-(A2), (f20l l guarantees that {xl‘^} is a bounded sequence. 

To summarize the above development, the proposed online 
parallel algorithm is formally described in Algorithm [T] and 
its convergence properties are given in the following theorem. 

Theorem 4 (Strong Consistency). Suppose Assumptions 
(A1)-(A3) as well as the following assumptions are satisfied: 
(A4) Both g„ and have bounded moments; 

(A5) > c for some c > 0; 

(A6) The sequence is nonincreasing, i.e., > 

p(*+i). 

Then x^*'> is a strongly consistent estimator of x*, i.e., 
liiTLt^oo = X* with probability 1. 

Proof: See Appendix fAl ■ 

Assumption (A4) is standard on random variables and is 
usually satisfied in practice. We can see from Assumption (A5) 
that if there already exists some c > 0 such that > c for 
all t, the quadratic proximal term in (|7]i is no longer needed, 
i.e., we can set = 0 without affecting convergence. This is 
the case when t is sufficiently large because limt^oo 0. 

In practice it may be difficult to decide if t is large enough, 
so we can just assign a small value to for all t in order 
to guarantee the convergence. As for Assumption (A6), it is 
satisfied by the previously mentioned choices of p^*^, e.g., 
pit) = a/t^ with a > 0 and 0.5 </)<!. 

Theorem |4] establishes that there is no loss of strong 
consistency if at each time instance, (|6ll is solved only ap¬ 
proximately by updating all elements simultaneously based 
on best-response only once. In what follows, we comment on 
some of the desirable features of Algorithm [T] that make it 
appealing in practice: 

i) Algorithm [T] belongs to the class of parallel algorithms 
where all elements are updated simultaneously each time. 
Compared with sequential algorithms where only one element 
is updated at each time instance Q, the improvement in 
convergence speed is notable, especially when the signal 
dimension is large. 

ii) Algorithm[T]is easy to implement and suitable for online 
implementation, since both the computations of the best- 
response and the stepsize have closed-form expressions. With 
the simplified minimization stepsize rule, notable decrease in 
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objective function value is achieved after each variable update, 
and the trouble of tuning the decay rate of the diminishing 
stepsize as required in i23n is also saved. Most importantly, 
the algorithm may not converge under decreasing stepsizes. 

iii) Algorithm [1] converges under milder assumptions than 
state-of-the-art algorithms. The regression vector g„ and noise 
Vn do not need to be uniformly bounded, which is required 
in EH and which is not satisfied in case of unbounded 
distribution, e.g., the Gaussian distribution. 


IV. Implementation and Extensions 
A. A special case: x* > 0 

The proposed Algorithm [T] can be further simplified if 
X*, the signal to be estimated, has additional properties. For 
example, in the context of CR studied in ||3], x* represents 
the power vector and it is by definition always nonnegative. 
In this case, a nonnegative constraint on Xk in © is needed; 


-(t) 

x\. = arg min 

Xk>0 


in |L(‘)(a;fc,x^*^^) -f ^c^k\xk - 


and the best-response in ® is simplified to 


■^k — 




, At) 

kxi.1. ~r Cj, 


k = l,...,K. 


^kk ^ '^k 

Furthermore, since both x*^*) and x*^*) are nonnegative, we have 

x(‘) + ^(xW - > 0, 0 < 7 < 1, 


and 

||xW + 7 (x(‘) - xW)||^ = + 7(4*^ - 4'4 

k^l 

fe=i 

Therefore the standard minimization rule (fT^ can be adopted 
directly and the stepsize is accordingly given as 

(G^xW - bW -b p(*)l)'^(x(*) - x(‘))1 ^ 

(x(‘) — (x(*) — x(*)) g’ 

where 1 is a vector with all elements equal to 1. 



equation 

+ 

X 


min(a, b) 

<2 1 at 

+ K)/2 

iN + l){K'^ + K)/2 

— 

— 

<2 IN 

NK 

{N+T)K 

— 

— 

(2) 

2K 

K'-‘ + K 

— 

— 

(8) 

bK 

K 

K 

2K 

GU 

&K -2 

k'^TIk 

1 

2K + 2 


Table I 

Computational Complexity of Algorithm|T] 


fusion center. Note that and defined in © can be 
updated recursively; 

n—1 

n—1 

After updating x according to ® and (fTbl l. the fusion center 
broadcasts x(*+^) G to all sensors. 

We discuss next the computational complexity of Algorithm 
[U Note that in (l2lTi . the normalization by t is not really compu¬ 
tationally necessary because they appear in both the numerator 
and denominator and thus cancel each other in the division in 
© and (fTbl l. Computing (I21al) requires {N + l)(iT^ + K)/2 
multiplications and (AT^ -b K)/2 additions. To perform (I21bl) . 
{N -b l)Ar multiplications and NK additions are needed. 
Associated with the computation of in I© are + K 
multiplications and 2K additions. Then in ®, 5K additions, 
K multiplications and K divisions are required. The projection 
in also requires 2K number comparisons. To compute 
(O, first note that G^^^x^*) — b^*) can be recovered from 
because G^^^x^*) — b^*^ = d(G(*^)ox^*) — and computing 
||x||^ requires K number comparisons and AT — 1 additions, 
so what are requested in total are -b dAT multiplications, 
6 AT — 2 additions, 1 addition, and 2 AT -b 2 number comparisons 
(the projection needs at most 2 number comparisons). The 
above analysis is summarized in Table H] and one can see that 
the complexity is at the order of K^, which is as same as 
traditional RFS El Ch. 14]. 

Network without a fusion center: In this case, the compu¬ 
tational tasks are evenly distributed among the sensors and 
the computation of © and (fTbl l is performed locally by each 
sensor at the price of (limited) signaling exchange among 
different sensors. 

We first define the following sensor-specihc variables Gn^ 
and bl*^ for sensor n as follows; 


(21a) 

(21b) 


B. Implementation issues and complexity analysis 

Algorithm [T] can be implemented in both a centralized and 
a distributed network architecture. To ease the exposition, we 
discuss the implementation issues in the context of WSN with 
a total number of N sensors. The discussion for CR is similar 
and thus not duplicated here. 

Network with a fusion center: The fusion center performs 
the computation of © and (fTbl l. To do this, the signaling 
from sensors to the fusion center is required; at each time 
instance t, each sensor n sends (gn\j/n ) ^ to the 


G(*) a 


r—1 


’(S: 


1 


andb(*) = - 2 ^y, 

T—1 


^..it)„{t) 

071 5 


so that G(*) = J2u=i and b(*) = J2u=i Note that 
G^*^ and b^*^ can be computed locally by sensor n and no 
signaling exchange is required. It is also easy to verify that, 
similar to dlB, gI*^ and bi*^ can be updated recursively by 
sensor n, so the sensors do not have to store all past data. 

The message passing among sensors in carried out in two 
phases. Firstly, for sensor n, to perform ©, d(G*^*^) and 
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are required, and they can be decomposed as follows: 


N 


d(GW) = ^d(GW)GM^, 

n—1 

(22a) 

G(*)x(*) - G 

n—1 

(22b) 

Furthermore, to determine the stepsize as in (fTbll and to 
compare with 0, the following variables are 

required at sensor n: 

N 

GP)x(‘) = Y G^x^*) G R^ 

n—1 

(22c) 

N 

G(*)x(‘) = Y G R^, 

(22d) 


n—1 


and 



(22e) 


but computing (I22el i does not require any additional signaling 
since is already available from (\22hi . 

Note that L^*'> can be computed from ( I22bl i-( l22dl i 

because 

2,(‘)(iW(^(t))) = i(i(b(^(t)))TG(t)i(t)(^W) 

= - 2 b(‘)) 

= i(x(*)(7(‘)))^(2(G(*)x(‘) - b(*)) - 

+ 7 (i)G(‘)(x(*) - x(*))) + ^(*)||x(*)(7(‘))||^, 

where G^^^x^*) —b^*) comes from (I22bl) . G^*^x(*) comes from 
(I22cl) . and G^*^(x(‘^ — x^*)) comes from (I22cl l- (l22dl l. 

To summarize, in the first phase, each node needs 
to exchange (d(Gn^), Gn^x^*) — bi*^) G while 

in the second phase, the sensors need to exchange 
(Gi*^x(*), gI^^x^*^) G thus the total signaling at 

each time instance is a vector of the size 4iT. The signaling 
exchange can be implemented in a distributed manner by, 
for example, consensus algorithms, which converge if the 
graph representing the links among the sensors is connected. 
A detailed discussion, however, is beyond the scope of this 
paper, and interested readers are referred to |[I1] for a more 
comprehensive introduction. 

Now we compare Algorithm [1] with state-of-the-art dis¬ 
tributed algorithms in terms of signaling exchange. 

1) The signaling exchange of Algorithm[T]is much less than 
that in ill]. In |[ll Alg. A.5], problem (|6]l is solved completely 
for each time instance, so it is essentially a double layer 
algorithm: in the inner layer, an iterative algorithm is used 


to solve © while in the outer layer t is increased to f -I- 1 
and © is solved again. In each iteration of the inner layer, a 
vector of the size 2K is exchanged among the sensors, and 
this is repeated until the termination of the inner layer (which 
typically takes many iterations), leading to a much heavier 
signaling burden than the proposed algorithm. 

2) A distributed implementation of the online sequential 
algorithm in Cl is also proposed in fl. It is a double layer 
algorithm, and in each iteration of the inner layer, a vector with 
the same order of size as Algorithm [T] is exchanged among 
the sensors. Similar to CIj this has to be repeated until the 
convergence of the inner layer. 

We also remark that in consensus-based distributed algo¬ 
rithms imci], the estimate decision of each sensor depends 
mainly on its own local information and those local estimates 
maintained by different sensors are usually different, which 
may lead to conflicting interests of the PUs (i.e., some may 
correctly detect the presence of PUs but some may not, so the 
PUs may still be interfered); an agreement (i.e., convergence) 
is reached only when t goes to infinity ll28ll . By comparison, 
in the proposed algorithm, all sensors update the estimate 
according to the same expression © and (fTbl l based on the 
information jointly collected by all sensors, so they have the 
same estimate of the unknown variable all the time and the 
QoS of PUs are better protected. 


C. Time- and norm-weighted sparsity regularization 

For a given vector x, its support 5x is defined as the set of 
indices of nonzero elements: 


S^ = {l<k<K iXkj^O}. 


Suppose without loss of generality iSx* = {1, 2,..., ||x*||q}, 
where ||xJL is the number of nonzero elements of x. It is 
shown in 11 1 ^ that the time-weighted sparsity regularization © 
does not make satisfy the so-called “oracle properties”, 
which consist of support consistency, i.e.. 


lim Prob 

t—¥oo 




= 1 , 


and v7-estimation consistency, i.e., 

^Kaslo,l:||x*||o “ ^l:||x*||o) A^jO, (T^ Gi: ||x* ||^ ,1: ||x* ), 

where -^d means convergence in distribution and Gi-.k^i-.k G 
jg jjjg upper left block of G. 

To make the estimation satisfy the oracle properties, it was 
suggested in iQ that a time- and norm-weighted lasso be 
used, and the loss function L^*\x) in © be modified as 
follows: 


.. t N 

LW(x) = -^^(z/M-(gM)^x)2 

r—1 n—1 

K 

+ (23) 

fe=l 

where: 

• x^*^ is given in ©; 
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• limj^oo = 0 and limt_>oo Vi ■ = oo, so 

must decrease slower than 1 /\/i; 

• The weight factor Wfj,{x) is defined as 

{ 1, if a: < /i, 

0 , if a; > afi, 

where a > 1 is a given constant. Therefore, the value ol 
the weight function in (| 2 ^ depends on 

the relative magnitude of x[l^i.. 

After replacing the universal sparsity regularization gain 
for element Xk in (l 8 ]l and (fT^ by W^{t) (|a;^^i*\|), Algorithm[T] 
can readily be applied to estimate x* based on the time- and 
norm-weighted loss function (l2Tt and the strong consistency 
holds as well. To see this, we only need to verify the nonin¬ 
creasing property of the weight function ^ 1 ). 

We remark that when t is sufficiently large, it is either 
= 0 or This is 

because limt_>oo = x* under the conditions of Lemma [T] 
If > 0 , since limj^oo = 0 > we have for any arbitrarily 
small e > 0 some to that < x^ — e for all t > to', 

the weight factor in this case is 0 for all t > to, and the 
nonincreasing property is automatically satished. If, on the 
other hand, = 0 , then x^j*^ converges to x^ = 0 at a speed 
of l/v/f i^ . Since decreases slower than IjVi, we have 
for some to such that for all t > to', in this case, 

W^(t) {x^l^f.) is equal to 1 and the weight factor is simply 
for all t > to, which is nonincreasing. 


D. Recursive estimation of time-varying signals 

If the signal to be estimated is time-varying, the loss 
function Q needs to be modified in a way such that the new 
measurement samples are given more weight than the old ones. 
Defining the so-called “forgetting factor” 0, where 0</3<l, 
the new loss function is given as follows lElHEl]: 


N t 

n—lT—1 

and as expected, when 0 = 1, (l24li is as same as (|5]l. In this 
case, the only modihcation to Algorithm [T] is that and 
b(*) are updated according to the following recursive rule: 

gw = U{t- i)^G(‘-i) -f E 

\ n^l 

b(‘) = i (^(i_l)/3b(‘-i) + Ey(‘)g(‘) 

V, n=l 

For problem (l24l l. since the signal to be estimated is time- 
varying, the convergence analysis in Theorem |4] does not hold 
any more. However, simulation results show there is little loss 
of optimality when optimizing (l24l i only approximately by 
Algorithm [1] This establishes the superiority of the proposed 
algorithm over the distributed algorithm in (U which solves 
(l24l i exactly at the price of a large delay and a large signaling 





Figure 1. Convergence behavior in terms of objective function value. 


burden. Besides, despite the lack of theoretical analysis. Al¬ 
gorithm [T] performs better than the online sequential algorithm 
il4|] numerically, cf. Figure | 6 ] in Section |V] 


V. Numerical Results 


In this section, the desirable features of the proposed 
algorithm are illustrated numerically. 

We first test the convergence behavior of Algorithm [T] 
with the online sequential algorithm proposed in 11411 . In this 
example, the parameters are selected as follows: 


■ V = 1 , so the subscript n is omitted. 

• the dimension of x*: K = 100; 

• the density of x*: 0 . 1 ; 

> Both g and v are generated by i.i.d. standard normal 
distributions: g G CM(0,I) and v G CAf(0,0.2); 

• The sparsity regularization gain gVl = VKjt = 10/t; 

• Unless otherwise stated, the simulations results are aver¬ 
aged over 100 realizations. 


We plot in Figure [T] the relative error in objective value 
—L(*)(X[E))/T^*H^i*'aslo) versus the time instance 


t, where 1) is defined in (| 6 ]l and calculated by MOSEK 
03311 : 2 ) x^*) is returned by Algorithm [T] in the proposed online 


parallel algorithm; 3) is returned by m Algorithm 1] 
in online sequential algorithm; and 4) x^^^ = 0 for both 
parallel and sequential algorithms. Note that V'^\Vi2sq) 
by dehnition the lower bound of V*'>(x) and V*'>(V*>) — 


L^*^(xjE) > 0 for all t. From Figure [T] it is clear that the 
proposed algorithm (black curve) converges to a precision of 
10 “^ in less than 200 instances while the sequential algorithm 
(blue curve) needs more than 800 instances. The improvement 
in convergence speed is thus notable. If the precision is set 
as 10 “"^, the sequential algorithm does not even converge in 
a reasonable number of instances. Therefore, the proposed 
online parallel algorithm outperforms in both convergence 
speed and solution quality. 

We also evaluate in Figure [T] the performance loss incurred 
by the simplihed minimization rule ([I5]l (indicated by the 
black curve) compared with the standard minimization rule 
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= 1/t 



time instance 


Figure 2. Convergence behavior in terms of relative square error. 


Figure 4. Stepsize error of the simplified minimization rule. 



Figure 3. Comparison of original signal and estimated signal. 




Figure 5. Weight factor in time- and norm-weighted sparsity regularization. 


(fT2l l (indicated by the red curve). It is easy to see from Figure 
[T]that these two curves almost coincide with each other, so the 
extent to which the simplified minimization rule decreases the 
objective function is nearly as same as standard minimization 
rule and the performance loss is negligible. 

Then we consider in Figure |2] the relative square error 
||x(‘) — X* II 2 / ||x*||2 versus the time instance t, where the 
benchmark is — x*]]^/ ||x*|| 2 , i.e., the recursive Lasso 

©. To compare the estimation approaches with and without 
sparsity regularization, RLS in (O is also implemented, where 
a regularization term (10“'*/f) • ||x ||2 is included into (| 6 ]l 
when is singular. We see that the relative square error 
of the proposed online parallel algorithm quickly reaches the 
benchmark (recursive lasso) in about 100 instances, while the 
sequential algorithm needs about 800 instances. The improve¬ 
ment in convergence speed is consolidated again. Another 
notable feature is that, the relative square error of the proposed 
algorithm is always decreasing, even in beginning instances, 
while the relative square error of the sequential algorithm is 


not: in the first 100 instances, the relative square error is 
actually increasing. Recall the signal dimension {K = 100), 
we infer that the relative square error starts to decrease only 
after each element has been updated once. What is more, 
estimation with sparsity regularization performs better than the 
classic RLS approach because they exploit the a prior sparsity 
of the to-be-estimated signal x*. The precision of the estimated 
signal by the proposed online parallel algorithm (after 1000 
time instances) is also shown element-wise in Figure [3 from 
which we observe that both the zeros and the nonzero elements 
of the original signal x* are estimated accurately. 

We now compare the proposed simplified minimization rule 
(cf. dnil-dlSll), coined as stepsize_simplified, with the standard 
minimization rule (cf. (fTSl i). coined as stepsize_optimal, in 
terms of the stepsize error defined as follows: 

stepsize_simplified-stepsize_optimal ^ 00 /^ 
stepsize_optimal 

In addition to the above parameter where = 10/f (in the 
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™ 10 


— online sequential algorithm (state-of-the-art) 

-online parallel algorithm (proposed) 

.reoursive lasso (benchmark) 



400 600 

time instance 


1000 


Figure 6. Relative square error for recursive estimation of time-varying 
signals 


lower subplot), we also simulate the case when = 1 /i 
(in the upper subplot). We see from Figure |4] that the stepsize 
error is reasonably small, namely, mainly in the interval [- 
5%,5%], while only a few of beginning instances are outside 
this region, so the simplified minimization rule achieves a good 
trade-off between performance and complexity. Comparing the 
two subplots, we find that, as expected, the stepsize error 
depends on the value of p. We can also infer that the simplified 
minimization rule tends to overestimate the optimal stepsize. 

In Figure |5] we simulate the weight factor 
versus the time instance t in time- and norm-weighted sparsity 
regularization, where fc = 1 in the upper plot and A: = 11 
in the lower plot. The parameters are as same as the above 
examples, except that = l/t°-^ and x* are generated such 
that the first 0.1 x iC elements (where 0.1 is the density of 
X*) are nonzero while all other elements are zero. The weight 
factors of other elements are omitted because they exhibit 
similar behavior as the ones plotted in Figure |5] As analyzed, 
>V^(t) the weight factor of the first element, where 

x\ ^ 0 , quickly converges to 0 , while W^(t) |), the 

weight factor of the eleventh element, where a;^ = 0 , quickly 
converges to 1 , making the overall weight factor monotonically 
decreasing, cf. (|2^ . Therefore the proposed algorithm can 
readily be applied to the recursive estimation of sparse signals 
with time- and norm-weighted regularization. 

When the signal to be estimated is time-varying, the theoret¬ 
ical analysis of the proposed algorithm is not valid anymore, 
but we can test numerically how the proposed algorithm 
performs compared with the online sequential algorithm. The 
time-varying unknown signal is denoted as xj, and it is 
changing according to the following law: 

^t+l,k ~ ^^t,k + 

where wt,k ^ CAf{0,1 — a^) for any k such that ^ 7 ^ 0, 
with a = 0.99 and /3 = 0.9. In Figure |6l the relative 
square error ||X( — x *||2 / ||xj ||2 is plotted versus the time 
instance. Despite the lack of theoretical consolidation, we 


observe the online parallel algorithm is almost as same as 
the pseudo-online algorithm, so the inexact optimization is 
not an impeding factor for the estimation accuracy. This also 
consolidates the superiority of the proposed algorithm over 
ifltl where a distributed iterative algorithm is employed to 
solve (I24I 1 exactly, which inevitably incurs a large delay and 
extensive signaling. 

VI. Concluding Remarks 
In this paper, we have considered the recursive estimation 
of sparse signals and proposed an online parallel algorithm 
with provable convergence. The algorithm is based on inexact 
optimization with an increasing accuracy and parallel update 
of all elements at each time instance, where both the update 
direction and the stepsize can be calculated in closed-form 
expressions. The proposed simplified minimization stepsize 
rule is well motivated and easily implementable, achieving 
a good trade-off between complexity and convergence speed, 
and avoiding the common drawbacks of the decreasing step- 
sizes used in literature. Simulation results consolidate the 
notable improvement in convergence speed over state-of-the- 
art techniques, and they also show that the loss in convergence 
speed compared with the full version (where the lasso problem 
is solved exactly at each time instance) is negligible. We 
have also considered numerically the recursive estimation of 
time-varying signals where theoretical convergence do not 
necessarily hold, and the proposed algorithm works better than 
state-of-the-art algorithms. 


Appendix A 
Prooe of Theorem|4] 

Proof: It is easy to see that L^^'> can be divided into a 
smooth part /(‘^(x) and a nonsmooth part 


/(*)(x)4VGWx-(b(*))^x, (25a) 

/t(*)(x)4;,W||x||,. (25b) 


We also use x^*)) to denote the smooth part of the 

objective function in ([ 8 ]): 






AO 


(x- 




(26) 


Functions /^*^(x;xA)) and /A)(x) are related according to 
the following equation: 


fk\xk-,y:^*'^) = (27) 


from which it is easy to infer that x^)) = 

Vfe/A)(xA)). Then from the first-order optimality condition, 
h^*\xk) has a subgradient S dh^*\x^^'^) at Xk = xf^ 
such that for any Xk'- 


(aife -f > 0, Vfc. (28) 


Now consider the following equation: 

L(0(x(‘+i)) - L(‘-i)(x(‘)) = 

L(0(x(‘+i)) - L(‘)(x(0) + l(‘)(x(‘)) _ lA-i)(x(‘)). 

(29) 
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The rest of the proof consists of three parts. Firstly we prove 
that there exists a constant p > 0 such that L^*'> — 

< —p ll^. Then we show that the 

sequence converges. Finally we prove that 

any limit point of the sequence is a solution of (| 2 |l. 

Part 1) Since > c > 0 for all t is defined in 
Proposition |2|i from Assumption (A5), it is easy to see from 
(fTTb that the following is true: 


2,(t)(xW +^(xW _x(*))) -lW(x(*)) 

< -l{c- iA^ax(GW) 7 ) ||x(‘) - x(‘) II 0 < 7 < 1 . 


Since A„iax(«) is a continuous function 11341] and con¬ 
verges to a positive definite matrix by Assumption (Al), there 
exists a A < +oo such that A > A„iax(G^‘i) for all t. We thus 
conclude from the preceding inequality that for all 0 < A < 1: 


2,W(xW + ^(xW _ x(‘i)) - 

< — 7 ||x^*^ — x^*^ll^. (30) 

It follows from (fT4l) . (fTsT i and (l30l l that 

7^(t)(x(‘+i)) 

< -f 7 (*)(x(‘) -x(*))) 

-f (1 - + 7(‘iM‘i(x(‘)) (31) 

< -f 7(x(*) -x(*i)) 

+ (1 - 7)/j(‘) (x(*)) + 7/i(‘) (x(‘)) (32) 

<L(‘i(x(*))-7(c- iA7)||x(‘i -x(*)||J. (33) 

Since the inequalities in ( l33l l are true for any 0 < 7 < 1, we 
set 7 = min(c/A, 1). Then it is possible to show that there is 
a constant p > 0 such that 


2 ,(‘)(x(‘+i)) - lW(x(‘)) < 

<-?7||x(*)-x(‘i|| 2 . (34) 

Besides, because of Step 3 in Algorithm [1] x(‘+^i is in the 
following lower level set of L(*)(x): 

4‘^^{x:L«(x)<0}. (35) 

Because ||x||j^ > 0 for any x, dTSl) is a subset of 

|x : ix^G(*)x - (b(‘))^x < o|, 

which is a subset of 

^<0 - : ^Amax(G(*)) ||x ||2 - (b(‘))^x < o|. (36) 

Since G^*^ and b*^*) converges and limj^oo 0, there 

exists a bounded set, denoted as £<o, such that £<g C C 
£<o for all t; thus the sequence is bounded and we 

denote its upper bound as x. 


Part 2) Combining (l29l l and (l34l i. we have the following: 

^(t+l)(x(*+2))_L(*)(x(*+l)) 

< (37) 

where the last inequality comes from the decreasing property 
of by Assumption (A 6 ). Recalling the definition of /*^*)(x) 
in dSll, it is easy to see that 

{t + l)(/(‘+i)(x(*+i)) - /W(x(‘+i))) 

= --Y, 


where 


N 




Taking the expectation of the preceding equation with 
respect to {yn^^\gn^^'^}n=i^ conditioned on the natural 
history up to time f + 1 , denoted as 

jr{t+l) ^ 

|x(°),... ,x(*+i), {gl°\..., {yi°\.. • , 

we have 


E 


= E 


= E 


{t + l)(/(‘+i)(x(*+i)) - 


^(t+i)(x(*+i))|J-(‘+i)j - -^E 


T—1 

t 




(38) 


T—1 


where the second equality comes from the observation that 
^('r)(x(*+i)) is deterministic as long as is given. This 

together with (iJTl l indicates that 


E 

< E 

< 




1 


< 


t 1 

1 

t + 1 


E 


E 




r=l 

t 


;(t+i)(x(*+i))|j-P+i) _ _^;M(x(‘+i)) 


and 


< 




J 0 


t 1 


E 


;(t+i)(xP+i))|j-(t+i)l _ i^;M(x(*+i)) 


< 


- 7 sup 

f + 1 xex 


T — 1 
t 


E 




, (39) 


where [a;]o = max(a;, 0), and X in ( |39] ) with X = 
x*^^\ • ■ ■, } is the complete path of x. 
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Now we derive an upper bound on the expected value of 
the right hand side of ( l39l i: 

t 


E 

= E 
< E 
= E 

where 


sup 


E 




sup|y«-(r(‘)fx + x'''R^''x| 


t 






sup|y^‘^| + sup|(b^*^)^x| + sup |x’^G^‘^x| 
xG^ xGX xG^ 


sup y 

xeA’ 


(t)| 


-E 


sup 

,xex 


+ E 


sup|x^G(‘)x 
xex 

(40) 


t N 




r—\n—1 

t N 

T—l n—1 
t N 


{y„.g„} [ynSn] - 

r)T'l 


Q(t) = i ^ ^ (^Eg„ [g„g„] - gi*)g(;) 

r—1n—1 

Then we bound each term in (l40l i individually. For the first 
term, since yO) is independent of xO), 


E 


sup 2 / 

xG-V 


(t)| 


= E 


\y 


(t)i 


= E 


{ymy 


< y/e[(z/(0)2] <J^ (41) 


t 

for some cti < oo, where the second equality comes from 
Jensen’s inequality. Because of Assumptions (Al), (A2) and 
(A4), i/O) has bounded moments and the existence of cti is 
then justified by the central limit theorem j^ . 

For the second term of (l40l) . we have 

\T 


E 


sup 


(b«)' 


<E 


sup(|b^*^ 1 )^ 


<(e |b^*^| ) 


X . 


Similar to the line of analysis of (HTt . there exists a (72 < oo 
such that 


E 


sup 




(e |b(*)| ) 


< E 


Ix| < 


(42) 


For the third term of (l40l t, we have 


sup X 

xGA' 


max |Afc(G(*))| • HxH^ 


7max{Ai,,(G(0),A^i„(G(0)} 



max{A^a^(G0)),A^in(G0))} 


= X 


Y^XUGW) 

k^l 

||^yE[tr(G(0(G(0)J’)' 


< 


(43) 


for some (73 < oo, where the first equality comes from the 
observation that x should align with the eigenvector associated 
with the eigenvalue with largest absolute value. Then combing 
(|4 T]i-(|43]i, we can claim that there exists a = + \f^ + 


' (j\ > 0 such that 


E 


sup 

xGA^ 


E 


((*+i)(x)|JF(*+i )1 _ 


< 




In view of ( l39l l, we have 


E 


E 


L(‘+i)(x(*+2)) - l(‘)(x(*+i))|J'(*+i) 

Summing (l44l i over t, we obtain 

OO 

^E E l(*+i)(x(‘+2)) - 


OJ 


< 


t=i 


f3/2- 

(44) 


< 00 . 


Then it follows from the quasi-martingale convergence the¬ 
orem (cf. S Th. 6 ]) that converges almost 

surely. 

Part 3) Combining ( |29] | and (l34l) . we have 
2 ,(t)(x(*+i))_L(‘-i)(xW) < 

-r;||x(‘) -xW|| 2 -fL(*)(x(‘))(45) 
Besides, it follows from the convergence of 

lim = 0 , 

t—^oo 

and the strong law of large numbers that 

lim lW(x(‘))-L(‘-i)(x(‘)) = 0 . 

t—¥oo 

Taking the limit inferior of both sides of (l45T l. we have 
0 = lim inf 

< liminf (-pllx^*) -x(‘)||^ -f L(*)(x(‘)) -L(‘-i)(x(‘))]. 

t—^OO L " ) 

< lim inf {—r]\\ x*-*^ — x*-*^ 11 ^ 1 

t-s-oo I " " 2 J 

-f lim sup I (x(*) ) - (x(‘) ) I 

t—¥oo ^ J 

= —rj ■ limsup||x^‘^ — x^*^ < 0 , 


so we can infer that limsup^. 


yt) — x(*) 11 = 0. Since 


0 < liminfi_i.oo — x^*) II < limsup^. 


f*)_xW|| = 
II 2 

0 , we can infer that liminft^oo||x(‘^ — x(‘^|| = 0 and thus 
limt^oo ||x^‘^ — x^*) 11 = 0 . 

Consider any limit point of the sequence {x^)}^, denoted 
as x*^°°^. Since x is a continuous function of x in view of ([ 8 ]l 
and limt_).oo ||x 0 ) — x 0 )||^ = 0 , it must be limt_>oo x^) = 
:^(oo) _ x(°°), and the minimum principle in (l28T l can be 
simplihed as 


(xfe - 4“^)(Vfe/(“)(x(°°)) + > 0, Wxk, 

summation over k = 1,..., K leads to 
(x-x(“))^(V/(°°)(x(°°)) -f > 0, Vx. 
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Therefore minimizes L^°°'>{x) and = x* almost 

surely by Lemma [T] Since x* is unique in view of Assump¬ 
tions (A1)-(A3), the whole sequence has a unique limit 

point and it thus converges to x*. The proof is thus completed. 
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